1 Introduction

Mesenchymal stromal cells (MSCs) are a subset non-hematopoietic adult stem cells that originate from the mesoderm. They have the capacity for self-renewal and differentiation into osteoblasts, adipocytes and chondrocytes making them important for the development of cell-based therapies in regenerative medicine. However, populations of MSCs are heterogeneous with respect to their differentiation capacity (Costa et al. 2020; Phinney 2012) and some exhibit immundomodulatory properties unhelpful for cell-based therapies. Understanding subtype heterogeneity is key for the development of efficacious therapies. The Genever Group has developed a model of this heterogeneity using has a number of immortalised clonal MSC lines which comprises five subtypes (Stone et al. 2019; Kay et al. 2019). The phenotypes of these subtypes is as follows:

The data are mass spectrometry data of the soluble protein fraction from five immortalised mesenchymal stromal cell (MSC) lines.

2 Methods

2.1 Data description

There are two data files: Y101_Y102_Y201_Y202_Y101-5.csv and comparison_p_and_q.csv

The data in Y101_Y102_Y201_Y202_Y101-5.csv are normalised protein abundances. Each row is a protein and there are three replicates for each subtype which are arranged in columns. Also in the file are columns for:

  • the protein accessions: the top hit is first, Human sequences are prefixed with 1::, bovine with 2::
  • the number of peptides used to identify the protein
  • the number of unique peptides used to identify the protein
  • a measure of confidence in that identification
  • the maximum fold change between the mean abundances of two cell lines (i.e., highest mean / lowest mean)
  • a \(p\) value for a one-way ANOVA for the effect of cell line
  • \(q\) the false discovery rate, a \(p\) value corrected by the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995).
  • a measure of the power of that test
  • the cell line with the highest mean
  • the cell line with the lowest mean
  • the protein mass
  • a description of the protein
  • a binary indicator variable for whether at least two peptides were detected for a protein.

The data in comparison_p_and_q.csv give the p and q values for pairwise comparisons between cell line.

2.2 Data preparation

I used R (R Core Team 2019) with tidyverse packages (Wickham 2017) to import and process the raw data. Bovine serum proteins from the medium on which the cells were grown and those for which fewer than two peptides were detected were filtered out. Abundances were summarised for each line and for each line-protein combination and log fold change between cell lines for each protein were calculated. See Equation (2.1). Protein sets were visualised with the VennDiagram package (Chen 2018).

\[\begin{equation} FC_{jj'} = log_{2}\left(\frac{mean_{j}}{mean_{j'}}\right) \tag{2.1} \end{equation}\]

where:

  • \(mean_{j}\) is the mean protein abundance for a given protein in cell line \(j\)
  • \(mean_{j'}\) is the mean protein abundance for a given protein in another cell line \(j'\)
  • such that \(FC_{jj'}\) is the \(log_{2}\) fold change between line \(j\) and line \(j'\) for a given gene.

2.3 Visualisation with Principal Components Analysis

Principal Components Analysis was used for dimensionality-reduction to allow visualisation the distance between samples. It was conducted on protein abundances scaled to unit variance to prevent highly abundant proteins dominating the analysis. The package GGally (Schloerke et al. 2020) was used to produced pairwise scatter plots of the first six principal components.

2.4 Clustering and heatmaps

Hierarchical clustering was conducted and heatmaps produced with the package heatmaply (Galili et al. 2017) with interactivity provided by plotly (Sievert 2020). Hierarchical clustering used the complete linkage method and was performed for both proteins and samples on the proteins that differed significantly between at least two lines. Significance was determined at the \(q < 0.01\) level where \(q\) is the False Discovery Rate (FDR) adjusted \(p\)-value. FDR is a method of correcting for multiple comparisons by the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995). The abundances were “\(z\)-score normalised” for each protein before clustering and heatmapping. \(Z\)-score normalisation is way to make comparisons between, and visualisations of, abundances on different scales. See Equation (2.2).

\[\begin{equation} z_{ijk} = \frac{x_{ijk} - \bar{x_{ij}}}{sd_{ij}} \tag{2.2} \end{equation}\]

where:

  • \(x_{ijk}\) is the abundance for protein \(i\), cell line \(j\) and replicate \(k\)
  • \(\bar{x_{ij}}\) is the mean abundance for protein \(i\) and cell line \(j\) across replicates
  • \(sd_{ij}\) is the standard deviation on the abundance for protein \(i\) and cell line \(j\) across replicates
  • such that \(z_{ijk}\) is the \(z\)-score normalised abundance for protein \(i\), cell line \(j\) and replicate \(k\)

2.5 Volcano plots

3 Results

There were 861 human protein identified from more than two peptides All 861 were found in all five cells line with two exceptions: DKK1 was not seen in Y201 and PAPPA2 was not seen in Y202. These are likely to be missing by chance from all three replicates rather than biologically meaningful. See figure 3.1. The total soluble protein content of each cell line was broadly similar. See figure 3.2.

Protein set shared by cells lines. Created with package VennDiagram (Chen 2018)

Figure 3.1: Protein set shared by cells lines. Created with package VennDiagram (Chen 2018)

Mean soluble protein content for each cell line. Error bars are \(\bar{x} \pm 1 s.e.\)

Figure 3.2: Mean soluble protein content for each cell line. Error bars are \(\bar{x} \pm 1 s.e.\)

3.1 Principal Components analysis

The first Principal Component captured 30.6% of the variation in soluble protein expression between samples and the first six captured 81.6%. The distributions of scores on PC1 for each cell lineage show good separation between Y1015 and the other cell lines. The two immunomodulatory lines, Y102 and Y202 are the least easy to separate on a single component. Pairwise scatter plots of PC1 to PC6 show greater, but still imperfect, separation of lines. See figure 3.3.

Samples represented on the first six Principal Components capturing 81.6% of the variation in soluble protein expression between samples. The leading diagonal shows the distribution of scores on each component with pairwise scatter plots of Principal Components below the diagonal. Created with the GGally (Schloerke et al. 2020) and patchwork (Pedersen 2020) packages.

Figure 3.3: Samples represented on the first six Principal Components capturing 81.6% of the variation in soluble protein expression between samples. The leading diagonal shows the distribution of scores on each component with pairwise scatter plots of Principal Components below the diagonal. Created with the GGally (Schloerke et al. 2020) and patchwork (Pedersen 2020) packages.

3.2 Clustering and heatmaps

There were 327 proteins with \(p < 0.05\) and 190 proteins with \(p < 0.01\) but 861 tests implies 43.05 and 8.61 respectively are false positives. The FDR adjusted \(p\)-value (or \(q\)-value) of 0.01 implies that 1% of significant tests will result in false positives. The abundance of 335 proteins differed significantly between at least two cell lines at the \(q < 0.05\) level and between 127 proteins at the \(q < 0.01\) level. This implies only 16.75 and 1.27 false positives.

The distribution of \(q\)-values was right skewed (see Figure 3.4) with a mean of 0.0917 and a median of 0.0745. This what we would expect to see.

Distribution of \(q\)-values (\(p\)-values corrected for multiple comparisons by the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995)).

Figure 3.4: Distribution of \(q\)-values (\(p\)-values corrected for multiple comparisons by the Benjamini–Hochberg procedure (Benjamini and Hochberg 1995)).

4 to be completed

Cell line replicates clustered by protein expression. These clustered corresponded The two immunomodulatory (see Figure 4.1) foorm one clus

Figure 4.1: my lovely heatmap

4.1 Volcano plots

4.2 Word count

Word count calculated with wordcountaddin (Marwick 2020). The session information was written into a separate file rather than to the README this has been added to the wordcount.

This document: 1278
README: 109
Session info: 467
Total: 1854

References

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” J. R. Stat. Soc. Series B Stat. Methodol. 57 (1): 289–300.
Chen, Hanbo. 2018. VennDiagram: Generate High-Resolution Venn and Euler Plots. https://CRAN.R-project.org/package=VennDiagram.
Costa, Luis A, Noemi Eiro, Marı́a Fraile, Luis O Gonzalez, Jorge Saá, Pablo Garcia-Portabella, Belén Vega, José Schneider, and Francisco J Vizoso. 2020. “Functional Heterogeneity of Mesenchymal Stem Cells from Natural Niches to Culture Conditions: Implications for Further Clinical Uses.” Cell. Mol. Life Sci., July.
Galili, Tal, O’Callaghan, Alan, Sidi, Jonathan, Sievert, and Carson. 2017. “Heatmaply: An r Package for Creating Interactive Cluster Heatmaps for Online Publishing.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btx657.
Kay, Alasdair, Alice Carstairs, Andrew Stone, Emma Rand, and Paul G Genever. 2019. “Altered Lipid/Cholesterol Metabolism in Mesenchymal Stromal Cells Following Low-Serum Adaptation Accelerates Osteogenesis in Vitro and in Vivo by Modifying Secretome Compartments.” In ResearchGate.
Marwick, Ben. 2020. Wordcountaddin: Word Counts and Readability Statistics in r Markdown Documents.
Pedersen, Thomas Lin. 2020. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
Phinney, Donald G. 2012. “Functional Heterogeneity of Mesenchymal Stem Cells: Implications for Cell Therapy.” J. Cell. Biochem. 113 (9): 2806–12.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2020. GGally: Extension to ’ggplot2’. https://CRAN.R-project.org/package=GGally.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.
Stone, Andrew, Rachel Crossland, Emma Rand, Alasdair Kay, Xiao-Nong Wang, Ian Hitchcock, and Paul Genever. 2019. “The Secretome of Mesenchymal Stromal Cells Drives Functional Heterogeneity.” Cardiff, UK: Joint Annual Meeting of the Bone Research Society and the British Orthopaedic Research Society.
Wickham, Hadley. 2017. Tidyverse: Easily Install and Load the ’tidyverse’. https://CRAN.R-project.org/package=tidyverse.